In this file, we begin to consider relationships between variables. We have two basic types of questions that we’re interested in:
Given the negative values in the disbursal_timediff variable, however, we discussed cutting the supply chain in two and examining the “posting-to-funding” component separately from the “disbursing-to-repaying” component.
We also generated some data partitioning ideas at the end of our single-variable EDA, namely:
Let’s load the data and get started.
library(jsonlite)
library(ggplot2)
library(psych)
library(RColorBrewer)
library(countrycode)
loans = jsonlite::fromJSON("newpaid.json")
loans$posted_datetime = strptime(loans$posted_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$funded_datetime = strptime(loans$funded_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$disbursal_datetime = strptime(loans$disbursal_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$paid_datetime = strptime(loans$paid_datetime, format = "%Y-%m-%d %H:%M:%S")
We know that we have no missing data, so we’ll dive right in.
First, we’ll generate a scatterplot matrix for the numeric loan feature variables:
pairs.panels(loans[, c(8, 9, 17, 18)])
Most pairs of variables are not correlated. There is no relationship between the length of a loan’s description and its amount, the number of borrowers, or the gender composition of its borrowers, nor is the gender composition of the borrowers related to the amount or number of borrowers. (The apparent structure in the number-female plot is merely an artifact of integer denominators.) The only correlation is between the loan amount and the number of borrowers. Although many loans, regardless of amount, have a single borrower – perhaps a representative for a larger group – the general trend is that a larger loan means more borrowers.
We see a few high-leverage points, where the loan amount is greater than 5000 USD, but these do not seem to be throwing off the overall trends.
Next, we want to consider categorical features of the loan, country and sector, in relation to our other loan variables. Let’s look at sector and loan amount first:
ggplot(data=loans, aes(x=sector, y=terms.loan_amount, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Loan amount") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
For the bulk of the loans, there isn’t much difference between sectors: the mass is concentrated below about 1500 USD. Sector distinctions seem to matter more when we look at the tails. The education, food, manufacturing, and service sectors all have multiple loans that extend well beyond 1500 USD. We can examine this segment of loans values by itself:
large_loans = loans[loans$terms.loan_amount >= 1500, ]
ggplot(data=large_loans, aes(x=sector, y=terms.loan_amount, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Loan amount") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Even for this segment, though, the median and IQR’s by sector are comparable. Perhaps segmenting by loan amount won’t prove interesting after all.
How does the sector relate to the number of borrowers? Loans with one borrower will dominate the plot, so let’s look only at loans with more than one borrowers:
ggplot(data=loans[loans$b.num>1, ], aes(x=sector, y=b.num, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Number of borrowers") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Again, the ranges are quite comparable. Entertainment sector is the only real stand-out.
Is there a relationship between gender and sector?
ggplot(data=loans, aes(x=sector, y=b.female, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Fraction of females in borrowing group") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Yes, there are certain sectors where the borrowers are almost exclusive female: arts, clothing, food, and retail. This is the exception, however. All in all, sector doesn’t seem to reveal much when it comes to the characteristics of a loan. We can quantify its effect on the loan amount:
summary.lm(aov(terms.loan_amount ~ sector, loans))
##
## Call:
## aov(formula = terms.loan_amount ~ sector, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -862.9 -430.4 -192.4 187.1 9282.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 716.6243 13.7199 52.232 < 2e-16 ***
## sectorArts 95.4826 42.6292 2.240 0.02512 *
## sectorClothing 152.0449 27.3213 5.565 2.67e-08 ***
## sectorConstruction 98.5481 44.4357 2.218 0.02659 *
## sectorEducation 163.6535 77.0978 2.123 0.03380 *
## sectorEntertainment 133.3757 180.4572 0.739 0.45986
## sectorFood 0.7903 18.1300 0.044 0.96523
## sectorHealth 182.1045 67.6630 2.691 0.00713 **
## sectorHousing -8.4243 39.6186 -0.213 0.83162
## sectorManufacturing 98.8635 52.1075 1.897 0.05781 .
## sectorPersonal Use 46.9174 74.7284 0.628 0.53012
## sectorRetail 17.2302 18.6082 0.926 0.35449
## sectorServices 171.2642 25.7245 6.658 2.89e-11 ***
## sectorTransportation -36.1917 35.0791 -1.032 0.30222
## sectorWholesale 31.9051 124.1945 0.257 0.79726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 719.7 on 13764 degrees of freedom
## Multiple R-squared: 0.007289, Adjusted R-squared: 0.006279
## F-statistic: 7.219 on 14 and 13764 DF, p-value: 3.424e-15
The very low adjusted R-squared means a negligible effect size.
What about country? Does it help explain the characteristics of a loan?
ggplot(data=loans, aes(x=location.country_code, y=terms.loan_amount,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Loan amount") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Here, we see a significant variance in loan amount, within and across countries. Some countries have a mean loan amount of over 3000 USD (the U.S., notably), even as many have a mean of under 1000 USD. Even for these latter countries, however, there can be a large number of loans in the upper quartile. (These are the dots lying outside the boxplots.)
Let’s remove countries that don’t show up much in our sample before analyzing how country relates to other loans variables.
high_countries = table(loans$location.country_code) >= 30
hc_vec = high_countries[loans$location.country_code] == TRUE
hc_loans = loans[hc_vec, ]
An ANOVA shows that country does a better job of explaining the variance in loan amount:
summary.lm(aov(terms.loan_amount ~ location.country_code, hc_loans))
##
## Call:
## aov(formula = terms.loan_amount ~ location.country_code, data = hc_loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1651.1 -269.8 -97.5 151.5 4348.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 704.78 70.69 9.970 < 2e-16 ***
## location.country_codeAM 1246.28 110.58 11.271 < 2e-16 ***
## location.country_codeAZ 531.89 79.50 6.691 2.31e-11 ***
## location.country_codeBJ -213.33 89.32 -2.388 0.016935 *
## location.country_codeBO 563.67 76.85 7.335 2.35e-13 ***
## location.country_codeCM -153.35 121.27 -1.265 0.206054
## location.country_codeCO -13.18 88.62 -0.149 0.881743
## location.country_codeCR 231.38 105.19 2.200 0.027851 *
## location.country_codeDO 870.02 87.72 9.919 < 2e-16 ***
## location.country_codeEC 115.05 76.80 1.498 0.134162
## location.country_codeGH -135.53 76.17 -1.779 0.075236 .
## location.country_codeGT 288.79 99.26 2.910 0.003625 **
## location.country_codeHN -127.86 88.89 -1.438 0.150358
## location.country_codeID -187.85 92.40 -2.033 0.042059 *
## location.country_codeIQ 1545.22 118.07 13.088 < 2e-16 ***
## location.country_codeKE -231.28 73.51 -3.146 0.001659 **
## location.country_codeKG 626.35 90.57 6.916 4.87e-12 ***
## location.country_codeKH 64.92 72.92 0.890 0.373341
## location.country_codeLB 549.71 81.01 6.785 1.21e-11 ***
## location.country_codeLR -260.60 98.25 -2.653 0.007998 **
## location.country_codeML 132.63 86.07 1.541 0.123326
## location.country_codeMN 591.03 81.99 7.209 5.95e-13 ***
## location.country_codeMX 207.07 76.45 2.709 0.006767 **
## location.country_codeMZ -190.80 103.71 -1.840 0.065845 .
## location.country_codeNG -95.35 77.40 -1.232 0.218008
## location.country_codeNI -157.57 73.74 -2.137 0.032638 *
## location.country_codeNP -468.35 121.27 -3.862 0.000113 ***
## location.country_codePE -77.97 72.11 -1.081 0.279602
## location.country_codePH -403.84 71.89 -5.618 1.97e-08 ***
## location.country_codePK 182.14 82.10 2.219 0.026527 *
## location.country_codePS 1165.03 106.81 10.907 < 2e-16 ***
## location.country_codePY 668.48 81.06 8.247 < 2e-16 ***
## location.country_codeQS -332.23 84.96 -3.910 9.26e-05 ***
## location.country_codeRW 177.18 84.55 2.096 0.036127 *
## location.country_codeSL 126.58 84.00 1.507 0.131865
## location.country_codeSN 338.81 82.32 4.116 3.88e-05 ***
## location.country_codeSV -46.79 80.05 -0.585 0.558848
## location.country_codeTG 95.73 79.94 1.198 0.231092
## location.country_codeTJ 236.75 75.35 3.142 0.001682 **
## location.country_codeTZ 56.71 78.67 0.721 0.471016
## location.country_codeUA 541.20 94.36 5.736 9.92e-09 ***
## location.country_codeUG 413.66 75.53 5.477 4.41e-08 ***
## location.country_codeVN 80.09 82.10 0.976 0.329287
## location.country_codeWS -81.76 84.08 -0.972 0.330852
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 582.9 on 13502 degrees of freedom
## Multiple R-squared: 0.2405, Adjusted R-squared: 0.2381
## F-statistic: 99.45 on 43 and 13502 DF, p-value: < 2.2e-16
The adjusted R-squared is 0.23, much better than the same statistic for sector. (If we include all countries, it climbs to 0.32.)
How does country relate to number of borrowers?
ggplot(data=hc_loans, aes(x=location.country_code, y=b.num,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Number of borrowers") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, loans with one borrower dominate, yet there are some countries where the number of borrowers tends to range more widely, such as Paraguay (PY).
Does country relate to the gender composition of the borrowers?
ggplot(data=hc_loans, aes(x=location.country_code, y=b.female,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Fraction of females in borrower group") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
In many countries, the loans are domainated by female borrower groups, but in many others, there are male groups as well. The most interesting feature of the above plot is probably the countries where there are a large number of mixed-gender borrower groups: Cambodia (KH), Peru (PE), The Philippines (PH), and Tanzania (TZ).
Finally, we can look at how our two categorical variables, country and sector, relate to one another. In our univariate EDA, we saw that agriculture, food, and retail were the main sector represented in the loan data. Is this true for each country?
getPalette = colorRampPalette(brewer.pal(11, "Spectral"))
num_sectors = length(unique(hc_loans$sector))
ggplot(data=hc_loans, aes(x=location.country_code, fill=sector)) +
geom_histogram(position="fill") +
scale_fill_manual(values = getPalette(num_sectors)) +
xlab("Country") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Not exactly. Those are the three largest sectors, but their relative frequencies differ from country to country. In some countries, agriculture accounts for well over 50% of loans, or even 100% of loans; in others, agriculture loans are outnumbered by food and retail loans.
The number of sectors can make it hard to be sure of these frequencies, so let’s look at the same plot but only for these top three sectors:
top_sectors = c("Agriculture", "Food", "Retail")
top_sector_loans = hc_loans[hc_loans$sector %in% top_sectors, ]
ggplot(data=top_sector_loans, aes(x=location.country_code, fill=sector)) +
geom_histogram(position="fill") +
scale_fill_manual(values = getPalette(3)) +
xlab("Country") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Now the distinctions are clear. This suggests that when we move on to multivariate analysis, we try conditioning on the borrower’s country before examining how sector relates to loan properties.
We have not yet looked at time series. Let’s do that now. We note that loans are not evenly spaced in time, so any time series analysis would quickly land us in complex territory. However, we can get a sense of the overall activity by looking the loan amount at the posted_datetime variable:
ggplot(data=loans, aes(x=posted_datetime, y=terms.loan_amount)) + geom_line() +
xlab("Time") + ylab("Loan Amount") + theme_bw()
The sporadic early activity suggests we not use dates early than 2007. Let’s narrow the window and use a smoothing function to get a better sense of the trend:
loans07 = loans[loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]
ggplot(data=loans07, aes(x=posted_datetime, y=terms.loan_amount)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Loan Amount") + theme_bw()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Total loan amount is growing, as we would expect with a growing business. How do the number of borrowers change over time?
ggplot(data=loans07, aes(x=posted_datetime, y=b.num)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Number of borrowers") + theme_bw()
Generally, this value goes up. And what about the gender composition of the borrowers?
ggplot(data=loans07, aes(x=posted_datetime, y=b.female)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Fraction of females in borrower group") + theme_bw()
This value climbs for the first year – meaning that, on average, a greater fraction of people in the borrowing group are female – and then dips for the next three years.
How do loan amounts change over time for the top three sectors?
ggplot(data=loans07[loans07$sector %in% top_sectors, ],
aes(x=posted_datetime, y=terms.loan_amount, colour=sector)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") +
scale_colour_discrete(name="Sector") +
theme_bw()
And finally, we ask: How do loan amounts change over time by country? To avoid cluttering the plot, we can look at only the top five countries, in terms of how many loans they received:
top_countries = sort(table(loans07$location.country_code), decreasing=T)[1:5]
ggplot(data=loans07[loans07$location.country_code %in% names(top_countries), ],
aes(x=posted_datetime, y=terms.loan_amount, colour=location.country_code)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") +
scale_colour_discrete(name="Country") + theme_bw()
Kenya had a spike in 2008 in which they took higher-value loans, while the per-loan value for other countries has been increasingly more stably – Nicaragua and Cambodia, in particular.
Let’s revisit our scatterplot matrix, only now we will add “pipeline features” – that is, the timediffs that represent the difference in days between the different stages in the loan pipeline. We can also add in the number of lenders.
pairs.panels(loans[, c(8, 9, 17, 18, 5, 14, 15, 16)])
Recall that we are separating the funding component of the pipeline from the repayment component. That noted, we observe that the loan amount and the funding time are correlated, albeit weakly; this is not surprising. What is somewhat more surprising is the correlation (again, weak) between funding time and gender. This correlation is negative, meaning that as the fraction of females in a borrowing group increases, the funding time tends to drop. There are similar trends, though even weaker, between the same variables and the repayment time.
The number of borrowers and the length of the description have no apparent effect on how quickly a loan is funded (or repaid). Lender count, meanwhile, is correlated with variables that one might expect: the loan amount, the number of borrowers, and the funding time. It is not correlated with the gender composition of the borrowing group. We can quickly see, too, that the dollar per lender is completely uncorrelated with gender:
cor(loans$b.female, loans$terms.loan_amount/loans$lender_count)
## [1] -0.01251265
If gender does have an impact on lenders’ behavior, it is not in how much they give, but rather on whether they give – and the more people that do give, the more quickly the loan is funded.
Now we ask how sector and country relate to pipeline variables. We’ll begin with sector and funding time:
ggplot(data=loans, aes(x=sector, y=funded_timediff,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Funding time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
The median funding time for all sector is quite low, with slightly larger IQR’s for sectors that one might expect: housing and transportation. The length of the each sector’s tail is the main distinguishing feature.
Does the number of lenders vary by sector?
ggplot(data=loans, aes(x=sector, y=lender_count,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Number of lenders") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Not really. As we saw before, sector and loan amount aren’t very related, and since loan amount is reasonably correlated with the number of lenders, the lack of a relationship here isn’t surprising.
Does sector help explain any variance in repayment time?
ggplot(data=loans, aes(x=sector, y=paid_timediff,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Repayment time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, the answer is not really.
We move on to examining the relationship of the borrower’s country to the funding time.
ggplot(data=hc_loans[hc_loans$funded_timediff >= 0, ],
aes(x=location.country_code, y=funded_timediff,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Funding time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
There are a few countries whose median value is much higher than the global median, which is a mere 0.71 days. These counties include:
hc_loans_0 = hc_loans[hc_loans$funded_timediff >= 0, ]
long_fund = sort(tapply(hc_loans_0$funded_timediff,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(long_fund), origin="iso2c", destination="country.name")
## [1] "Iraq" "Palestine, State of"
## [3] "Lebanon" "Armenia"
## [5] "Colombia" "El Salvador"
## [7] "Azerbaijan" "Bolivia, Plurinational State of"
## [9] "Kyrgyzstan" "Tajikistan"
How does the number of lenders relate to the borrowing country?
ggplot(data=hc_loans, aes(x=location.country_code, y=lender_count,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Number of lenders") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, we see a fair amount of variance in the medians. Which countries require the most lenders?
high_lend_ct = sort(tapply(hc_loans_0$lender_count,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_lend_ct), origin="iso2c", destination="country.name")
## [1] "Iraq" "Armenia" "Palestine, State of"
## [4] "Kyrgyzstan" "Dominican Republic" "Mongolia"
## [7] "Lebanon" "Ukraine" "Costa Rica"
## [10] "Azerbaijan"
Some of these names appeared in the previous list, but others are new. If we look at the median loan amount by country, we can get a sense for why (some of) these countries require more lenders and more funding time:
high_loan_amt = sort(tapply(hc_loans_0$terms.loan_amount,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_loan_amt), origin="iso2c", destination="country.name")
## [1] "Iraq" "Palestine, State of" "Armenia"
## [4] "Dominican Republic" "Azerbaijan" "Lebanon"
## [7] "Kyrgyzstan" "Mongolia" "Ukraine"
## [10] "Costa Rica"
They have some of the biggest loans. (There are countries that have even larger loans, but fewer loans in number, such as the United States.)
Finally, we examine repayment time by country.
ggplot(data=hc_loans, aes(x=location.country_code, y=paid_timediff,
colour=location.country_code)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Country") + ylab("Repayment time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Here, the disparities are quite significant. If we wish to predict how long it will take a borrower to pay back a loan, that borrower’s country could certainly be helpful.
How do these new pipeline variables relate to the calendar time? We can check.
ggplot(data=loans07,
aes(x=posted_datetime, y=funded_timediff)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Funding time") +
theme_bw()
Funding time has been slowly increasing – but not by much. The maximum toward the end of record keeping could stand some investigation.
Do we see a corresponding increase in the repayment time?
ggplot(data=loans07,
aes(x=posted_datetime, y=paid_timediff)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Repayment time") +
scale_colour_discrete(name="Sector") +
theme_bw()
From mid-2009 on, the answer is yes: repayment time is increasing. However, there is a different unexplained maximum toward the start of recording keeping.
Here is what we have learned:
We also encountered some curiosities in the data that we should address before moving on. Specifically:
head(hc_loans)
## id status sector partner_id lender_count
## 1 334113 paid Food 183 42
## 2 313179 paid Transportation 145 24
## 3 206002 paid Retail 145 15
## 4 354610 paid Education 113 15
## 5 226099 paid Agriculture 109 10
## 6 78960 paid Food 70 85
## description.languages location.country_code terms.loan_amount
## 1 en SL 1175
## 2 en PH 600
## 3 en PH 425
## 4 es, en GT 400
## 5 en KH 500
## 6 en PE 2725
## description.avgchar posted_datetime funded_datetime
## 1 690 2011-09-11 00:40:06 2011-09-29 01:06:26
## 2 524 2011-07-04 15:20:03 2011-07-08 06:18:03
## 3 1637 2010-07-01 07:00:16 2010-07-15 17:11:28
## 4 472 2011-11-05 23:50:06 2011-11-06 02:52:45
## 5 731 2010-10-01 07:00:09 2010-10-02 16:54:39
## 6 1890 2008-12-12 16:30:14 2008-12-16 15:52:58
## paid_datetime disbursal_datetime funded_timediff
## 1 2012-09-15 09:19:24 2011-08-15 07:00:00 18.0183
## 2 2012-01-15 11:01:44 2011-06-13 07:00:00 3.6236
## 3 2010-12-15 15:08:24 2010-05-07 07:00:00 14.4244
## 4 2012-05-15 08:52:45 2011-10-10 07:00:00 0.1685
## 5 2011-05-15 10:07:34 2010-07-28 07:00:00 1.4128
## 6 2009-06-15 22:51:05 2008-11-30 08:00:00 3.9741
## disbursed_timediff paid_timediff b.num b.female
## 1 -44.7545 397.0968 1 0
## 2 -24.9709 216.2095 1 1
## 3 -69.4246 222.3808 1 1
## 4 -26.8700 218.0783 1 1
## 5 -66.4130 291.1303 1 1
## 6 -16.3284 197.5771 13 1
hc_loans2 = hc_loans[hc_loans$funding_timediff >= 0, ]
hc_loans3 = hc_loans2[hc_loans$terms.loan_amount < 6000, ]
hc_loans4 = hc_loans3[hc_loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]
hc_loans4_json = jsonlite::toJSON(hc_loans4)
write(hc_loans4_json, file="hc_loans4.json")